library("tidyverse")
library("babynames")
library("knitr")
theme_set(theme_minimal(14))Baby names
Baby names data
Let’s load the packages that we need, and also specify that we want to use theme_minimal() for all plots.
The babynames package provides a data set (also called babynames) that we’re going to work with here. Let’s view the first 10 rows as usual:
kable(head(babynames, n = 10))| year | sex | name | n | prop |
|---|---|---|---|---|
| 1880 | F | Mary | 7065 | 0.0723836 |
| 1880 | F | Anna | 2604 | 0.0266790 |
| 1880 | F | Emma | 2003 | 0.0205215 |
| 1880 | F | Elizabeth | 1939 | 0.0198658 |
| 1880 | F | Minnie | 1746 | 0.0178884 |
| 1880 | F | Margaret | 1578 | 0.0161672 |
| 1880 | F | Ida | 1472 | 0.0150812 |
| 1880 | F | Alice | 1414 | 0.0144870 |
| 1880 | F | Bertha | 1320 | 0.0135239 |
| 1880 | F | Sarah | 1288 | 0.0131961 |
Fun with filtering
Problem 1: Explaining an odd line graph
Let’s attempt to filter the babynames dataset to get only names of people in this class. Note the use of the %in% operator, which is a convenient way of checking if some value belongs to a given set. It’s a cleaner alternative to something like name == "Fernando" | name == "Tara" | name == "Colby" ....
this_class <- babynames |>
filter(name %in% c("Tara", "Colby", "Evan", "Kalei",
"Peyton", "Sydney", "Bright", "Fernando"))Below, we try to make a simple line graph with year on the x axis and prop on the y axis, with a different color for each name. You should get a plot that looks weird.
Why is this happening? (Hint: inspect the data for “Sydney” in the year 2000)
Problem 2: Fix the filter
Modify the filter statement above to obtain a data set of names for this class that does not have this problem. Use the exact same ggplot code as above to verify that the problem is gone. Recall that the “or” operator in R is | and the “and” operator in R is &.
Answer the following question:
- Describe (in words) the logic of this modified
filter()statement (i.e., what rows are retained in the data set?). How did this solve our problem?
Line graphs
Problem 3: Trends over time
If we’re interested trends for each name over time rather than relative popularity, the previous graph is of limited use. In particular, the popularity of some names in the 20th and early 21st century makes it hard to see the trends over time for some of the less-common names.
Show the same data as small multiples with the following characteristics:
- Each name has in its own separate small panel.
- Each small panel has its own y-axis that is scaled to the popularity of the name.
- Remove color so that all lines are simply black.
Problem 4: Ribbons
We can emphasize the temporal trends more effectively by filling the area under each curve. We can do this using geom_ribbon(). Try to add a ribbon to the previous plot with 50% transparency (alpha = 0.5). To get the aesthetic mappings right, it may be helpful to read to the help page for geom_ribbon().
Problem 5: Adding color and fill
Although color would be redundant here, let’s bring it back in to get some practice with ribbons. Try mapping fill and/or color to name to get the plot to look exactly like the one I made in the solutions file. Note in particular that there is no solid outline on the sides and bottom! You can try doing this globally or in the geom functions, but not all configurations will produce the desired effect. The color/fill legend is pointless, so let’s turn it off by adding this line to the plot: guides(color = "none", fill = "none").
A pairwise comparison
Problem 6: Two lines with labels
If you’re only plotting a few lines, then plotting them together rather than in small multiples allows you to compare the trends more easily. Directly labeling the line with the name is cognitively easier to process than matching up a color with a legend. Let’s try making a line graph for the names “Sydney” and “Peyton” only, with names labeling the lines. Here’s the code to do that:
sydney_peyton <- babynames |>
filter((name == "Sydney" & sex == "F") | (name == "Peyton" & sex == "F"))
ggplot() +
geom_line(data = sydney_peyton, aes(x = year, y = prop, color = name)) +
geom_text(data = filter(sydney_peyton, year == max(year)),
aes(x = year, y = prop, color = name, label = name), hjust = 0, nudge_x = 1) +
guides(color = "none") +
expand_limits(x = 2025) +
labs(x = "Year", y = "Proportion")Answer the following questions about this code and its plot:
- Describe the data set used for the
geom_text()function. How does the modified filter statement accomplish what we need? How many rows does this subset have? - What code here is keeping the text from overlapping with the line?
- Try deleting
hjust = 0(the default ishjust = 0.5), or replacing it withhjust = 1. What does this option do? - View the help file for
expand_limits()and describe how it improves our plot.
Problem 7: geom_path()
There’s another similar geom called geom_path(). To see how it differs from geom_line(), we’re going to sort our data set by the column prop. This is our first introduction to a lovely dplyr function called arrange(). Like other tidyverse functions that you have encountered, the function takes a data set as its first argument. After that, we simply give it the column that we want to sort by. In this case, we’re sorting the rows by popularity of that year/name/sex combination, from least to greatest.
sydney_peyton_ordered <- sydney_peyton |>
arrange(prop)Now, we create a plot using the ordered data.
# Lines
ggplot() +
geom_line(data = sydney_peyton_ordered, aes(x = year, y = prop, color = name)) +
geom_text(data = filter(sydney_peyton_ordered, year == max(year)),
aes(x = year + 1, y = prop, color = name, label = name), hjust = 0) +
guides(color = "none") +
expand_limits(x = 2025) +
labs(x = "Year", y = "Proportion", title = "geom_line()")# Paths
ggplot() +
geom_path(data = sydney_peyton_ordered, aes(x = year, y = prop, color = name)) +
geom_text(data = filter(sydney_peyton_ordered, year == max(year)),
aes(x = year + 1, y = prop, color = name, label = name), hjust = 0) +
guides(color = "none") +
expand_limits(x = 2025) +
labs(x = "Year", y = "Proportion", title = "geom_path()")Answer the following questions about this code and its plot:
- Describe in detail the difference between
geom_line()andgeom_path(). Why did the reordering have no effect on one of the geoms? - If you were plotting an animal’s GPS location data on x and y axes to show its movement, which geom would produce the right plot?
A case study
Problem 8: Highlighting regions using annotations
Finally, let’s practice adding annotations to plots. An annotation is graphical element, like a point, line, box, or text label, that is added to a plot but is not part of your data set. Annotations can be added using the annotate() function. Note that writing \n in the text forces a line break.
We’re going to plot the popularity over time of the name Lionel for boys, and annotate some of interesting features with both rectangles and text. Here’s a template:
ggplot(filter(babynames, name == "Lionel" & sex == "M"),
aes(x = year, y = prop)) +
geom_ribbon(aes(ymax = prop, ymin = 0), alpha = 0.5) +
geom_line() +
annotate(geom = "rect", xmin = 1950, xmax = 1960, ymin = 0, ymax = Inf, fill = "red", alpha = 0.25) +
annotate(geom = "text", x = 1949, y = 3e-4, label = "Some text\nMore text", color = "red", hjust = 1) +
labs(x = "Year", y = "Proportion", title = "Popularity of the name Lionel")Note that putting \n in the label creates a line break. Using this template, highlight and label the following notable periods in the history of the name Lionel.
- 1931 – 1936: “Peak popularity in early 30s” (I used color “seagreen3”)
- 1982 – 1986: “Lionel Richie’s 3 best-selling albums” (I used color “coral”)
- 2004 – 2017: “Lionel Messi’s career with FC Barcelona” (I used color “skyblue2”)
Life tables
Getting sick of hearing about babies? Let’s switch gears and talk about DEATH.
The babynames package also includes cohort life tables based on data from the US Social Security Administration. What are life tables? They’re demographic tools that are used to analysz death rates and calculate life expectancies at various ages and across time.
Let’s get a look at the data:
kable(head(lifetables))| x | qx | lx | dx | Lx | Tx | ex | sex | year |
|---|---|---|---|---|---|---|---|---|
| 0 | 0.14596 | 100000 | 14596 | 90026 | 5151511 | 51.52 | M | 1900 |
| 1 | 0.03282 | 85404 | 2803 | 84003 | 5061484 | 59.26 | M | 1900 |
| 2 | 0.01634 | 82601 | 1350 | 81926 | 4977482 | 60.26 | M | 1900 |
| 3 | 0.01052 | 81251 | 855 | 80824 | 4895556 | 60.25 | M | 1900 |
| 4 | 0.00875 | 80397 | 703 | 80045 | 4814732 | 59.89 | M | 1900 |
| 5 | 0.00628 | 79693 | 501 | 79443 | 4734687 | 59.41 | M | 1900 |
Columns of interest:
x= Age in years.qx= Mortality rate at age x, basically the probability of dying during that age interval.ex= Expectation of further life for an individual of age x.year= Birth cohort, a hypothetical population of 100,000 individuals born in that year.sex= You can probably figure this one out.
Problem 9: Cohort mortality rates by sex
Let’s start by making two plots of the mortality rate (qx) against age, and you can use whatever colors you want.
In the first plot, stratify (color) by birth cohort (year) and facet by sex. Note that because year is coded as a continuous value, you’ll want to convert it to a character or factor when plotting (use as.character(year) when it appears in your code).
In the second plot, stratify by sex and facet by year.
Probabilities of death get quite high at very old or young ages, so let’s zoom in on the plot using coord_cartesian() with x and y limits (just add this to your plots):
coord_cartesian(xlim = c(0, 80), ylim = c(0, 0.025))
Answer the following questions about this code and its plot:
- What broad patterns are better illustrated in the first plot?
- What broad patterns are better illustrated in the second plot?
- What unusual features do you see (in either plot), and what could explanation them?
Problem 10: Change in life expectancies by sex
How has life expectancy at birth changed over time, and how does this differ between males and females? To visualize this, plot the life expectancy at age 0 only for each birth cohort, using both points and lines, and color by sex. This time, don’t set limits for x or y.
Answer the following questions about this code and its plot:
- How has life expectancy changed between 1900 and 2010?
- How do life expectancies differ between males and females?
- Has the male-female life-expectancy gap closed, widened, or stayed the same over the last century or so?